NYC Taxi Explore Data Analysis

Intro of NYC Taxi Data

NYC Taxi and Limousine Commission, which includes pickup time, geo-coordinates, number of passengers, and several other variables.

We have big data from Kaggle and New York Taxi, each year taxi companies store a large amount of customer information for analysis to predict trends and generate insight.

The yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

From the data below, the data processing stages including important steps will be listed in detail:

Fare-amount Analysis: Analyze sales that profit from each trip based on features: Fare_Amount, Trip_Duration.

Spatial Data Analysis: Analyze the density of each area on a heatmap combined with spatial data.

Trip Duration Analysis: Based on pick-up and drop-off locations, trip distances, I can give you a specific view of New York taxi user behavior using time seriese, day, month, year, say fare_amount and trip duration.

My project has invested a lot of time and effort as a specialized Data Analyst. I did learn a lot of different resources to make the EDA become purest.Because I will upload to github and share it with everyone.

Alt text that describes the graphic

EXPLORE DATA ANALYSIS

Import Data

On the same data i split it into three parts one part 5000 rows,

Df_train5k meaning 5000 rows,

train meaning 500000 rows.

Each separate data_name is used in a different graph because it is a heavy dataset. Therefore, it takes a long time for the computer to load; I find a way to reduce it. If possible, I can use a computer with a powerful CPU to load extensive data with more than 1 million rows.

id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration
0 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 -73.982155 40.767937 -73.964630 40.765602 N 455
1 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1 -73.980415 40.738564 -73.999481 40.731152 N 663
2 id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1 -73.979027 40.763939 -74.005333 40.710087 N 2124
3 id3504673 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1 -74.010040 40.719971 -74.012268 40.706718 N 429
4 id2181028 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1 -73.973053 40.793209 -73.972923 40.782520 N 435
5 id0801584 2 2016-01-30 22:01:40 2016-01-30 22:09:03 6 -73.982857 40.742195 -73.992081 40.749184 N 443
6 id1813257 1 2016-06-17 22:34:59 2016-06-17 22:40:40 4 -73.969017 40.757839 -73.957405 40.765896 N 341
7 id1324603 2 2016-05-21 07:54:58 2016-05-21 08:20:49 1 -73.969276 40.797779 -73.922470 40.760559 N 1551
8 id1301050 1 2016-05-27 23:12:23 2016-05-27 23:16:38 1 -73.999481 40.738400 -73.985786 40.732815 N 255
9 id0012891 2 2016-03-10 21:45:01 2016-03-10 22:05:26 1 -73.981049 40.744339 -73.973000 40.789989 N 1225

Data Processing:

Anomalies in trip duration, %: 0.74
Trip duration in seconds: 60 to 7191
Empty trips: 17

Function

Missing Value

You selected dataframe has11columns.
There are0columns that have missing values.
Missing Values % of Total Values
You selected dataframe has8columns.
There are0columns that have missing values.
Missing Values % of Total Values

Result Check: After cross-checking the missing columns, our data is entirely clean with limited without any missing values and is being trustful. Including big data such as data in New York Centrum City, we have to ensure that there are fewer null values inside to support it work well in the next step because we will work with Spatial Data.

key fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
0 2009-06-15 17:26:21.0000001 4.5 2009-06-15 17:26:21 UTC -73.844311 40.721319 -73.841610 40.712278 1
1 2010-01-05 16:52:16.0000002 16.9 2010-01-05 16:52:16 UTC -74.016048 40.711303 -73.979268 40.782004 1
2 2011-08-18 00:35:00.00000049 5.7 2011-08-18 00:35:00 UTC -73.982738 40.761270 -73.991242 40.750562 2
3 2012-04-21 04:30:42.0000001 7.7 2012-04-21 04:30:42 UTC -73.987130 40.733143 -73.991567 40.758092 1
4 2010-03-09 07:51:00.000000135 5.3 2010-03-09 07:51:00 UTC -73.968095 40.768008 -73.956655 40.783762 1
... ... ... ... ... ... ... ... ...
49995 2013-06-12 23:25:15.0000004 15.0 2013-06-12 23:25:15 UTC -73.999973 40.748531 -74.016899 40.705993 1
49996 2015-06-22 17:19:18.0000007 7.5 2015-06-22 17:19:18 UTC -73.984756 40.768211 -73.987366 40.760597 1
49997 2011-01-30 04:53:00.00000063 6.9 2011-01-30 04:53:00 UTC -74.002698 40.739428 -73.998108 40.759483 1
49998 2012-11-06 07:09:00.00000069 4.5 2012-11-06 07:09:00 UTC -73.946062 40.777567 -73.953450 40.779687 2
49999 2010-01-13 08:13:14.0000007 10.9 2010-01-13 08:13:14 UTC -73.932603 40.763805 -73.932603 40.763805 1

50000 rows × 8 columns

Feature Engineering

id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration
0 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 -73.982155 40.767937 -73.96463 40.765602 N 455

Corelation map of feature

<ipython-input-13-05190077e924>:11: DeprecationWarning:

`np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

<AxesSubplot:>

Positive Red: Both variables can occur in the same trend, this one increase the other increase.

Negative Light: Happens the opposite way.

The correlation between the properties shown on the image some things to note along with passenger_count is also positive , drop_off latitude and pickup_latitude are red which means they are positively compatible with each other about 0.30 point, both they are almost the same. If either one increases, they both increase.

Light color means negative for other features. That is, when one variable's value increases, the other variables' values decrease in vendor_id, pickup_lat.

Visualization

Text(0, 0.5, 'No of Trips made')

The pickup time distribution by weekdays in a week. It seems like Friday is the most popular day to hail a taxi with close to 220,000 trips made, while Sunday is on the other end of the spectrum with approximately 165,000 trips.

Text(0, 0.5, 'No of Trips made')

Weekdays are fairly stable, in the evening about 19 to 23 heart is time the most dense, more than 12000 times the pickup range. Wednesday and Thursday there are many departures at the pick-up point, and Sunday there are few departures.

Between 1 am and 5 am, there are fewer departures. From 9 am to noon at that time is also higher than 800 times to pick up guests.

/home/markn/.local/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning:

Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

Fair Amount

Reasonable amounts from about 0 to 200 Dollar. It is the amount that customers spend on each trip. The chart below helps us clearly see the trend and level of spending by specific hours.

This list will specify for each hour each expenditure based on the level of the trip dense.

<matplotlib.legend.Legend at 0x7fa22434f490>

As you've seen, where there's more than 0.10 density, a fairly affordable client is in the range of 0 to $30 that's the peak of density. Around 40 to 50 dollars there is some degree of density from 0 to 0.01. There are very few trips exceeded 100 euros. Most of the trips away from the city's outskirts.

Where dense coverage and low cost has mostly short trips through the location in the center of New York.

Where low density but high costs majority happening at the airport when customer move the shuttle back and forth between the center and the airport.

Detail of Fair Amount.

I use the time_slicer function to represent many different graphs at the same time. Because users have many different time period as week, month, weekday.

According to some simulations above. This graph details the Fair_Amounts that occur by month, year, and hour of the day.

Month: before that there was also a graph showing in the month I just repeated to check the presence of the graph.

Year: The amount increased gradually from 2009 to the peak of 2014 which is quite special, the day the fees are getting higher and higher.

Day: Rates during weekdays are high on Tuesdays and lowest on Thursdays and weekends are also high.

Hours: Rates during the hours of the day spike early in the morning at 5 AM.

Intermedia Visualization

When looking at the heatmap from Saturday and Sunday, There are few pick-ups taking place in the early morning from 5am to 9am.During weekdays. Taxis starting to pick up passengers at about 7-8am to afternoon. Most of it is quite busy with high density, when staying with a bright yellow color in the evening from 16 to 20 pm. Density with more than 0.05 for heatmap.

On Sunday we can also see customers picking up cars at midnight.With the density also bright from 0.05. The time is at 24 o'clock to 2 3 am.

Going further, we will have a closer look at the trip distribution thanks to the heatmap.

Understand the trip cost / fare_amount

Analysis:

We can see the fair $ for hour during 24 hours in Difference year around 5 AM it very high density number. Also a bit rasing after 3 Pm. Morning here observed as the rush hours.

2009 The cheapest time of the year with the highest price in that time is more than 12 dollars. At noon, it's also quite affordable for people from less than 10 dollars to 12 dollars. Located between 10:00 a.m. and 3:00 p.m. on the x_axis.

2010 the trip fare amoutn from 0 to 24 increase a small amount.The average cost is neither too high nor too low. Such morning time is still the busiest time and the highest price that customers pay when it is time to start a working day in an energetic city like New York.

2011 It still stay on the normal level of price. The good stable and most affordable rates for residents in new york at the moment when compared to other years.

2012 Costs become high mutation and abnormal. Meanwhile, the high prices suddenly jump from under 14 to over 18 dollars in euros fair amount at the time of the peak of the morning at 5 am

2013 At around 5 AM, costs at their highest ddiem seem to have reached an unusual limit and peak compared to previous years. then there is a rush hour wave between 16 PM. That time may be the time for office workers to go home. The afternoon is still stable. Freight transportation prices that reach over 16 dollars at USD fair amount.

Log Trip Duration:

TRIP_DURATION skewed variables must have a long tail. Therefore the graph below helps me normalized tail length distribution. Make it become more balanced and all.

No handles with labels found to put in legend.
No handles with labels found to put in legend.

Result of Normalzation Trip Log: The log of trip_duration distributes normally and we can also identify peaks. If when applied to the response variable in the numerical data model. We need to note the conversion of the trip time in the log to the basic form.

There is a small outlier on the far right but its impact is small. We have a maximum value of Log1p of numpy which simply means 3.52628210^{6} which is equivalent to 41 days. There are times when the driver opened the box meterde data storage and there are times when the driver is not open to charge.

Old size: 50000
New size: 49994

We then filtered out expenses greater than $0. I still do not know but I guess some flights not charged.

Then I also check the different times of costs under a hundred dollars. Most of the customers have paid as high as 6000 times in the range of 5 dollars more.

The majority began descending deviation to the right.

Fair Amount and Distance

Display the plot of fare amount <100 $ USD. All data bar chart focus on understand how trip fair going on. As we can see the highest is > 6000 trip.

Blue: Blue represents the panoramic trip distance from 0 to more than 5000 miles. Means the density from the center out to the edge of the city, and the surrounding suburbs. It also shows a dark blue turn located in close angles from 0 to 15 miles miles is very high. And the few on the right side of the picture over 5000 might be out of town.

Green: We have lots of dark green range. From 0 miles means city center. Rates range from low to high majority below 60 dollar range. The most extensive and densest range is from 0 to 6 miles and some is from 6 to more than 14 miles.

Reason with less than 15 mile: Because the density of the trip within a 15 mile radius is so great, we wanted to see more clearly. And normal as you can see on the graph different it also shows price below 100 dollar charge with strong density and frequency that the taxi ride was completed large also.

Text(0, 0.5, 'fare_amount')

Left Picture: show the total data of distance miles.

Right Picture: pickup distance from NYC center from 0 to 8000 index which locate from the center to outside of new york.

BarPlot in the right side. There is a lot of 'purple pink' dots, which is about $50 to \$60 fare amount near 13 miles distance of NYC center of distrance of trip. This could be due to trips from/to JFK airport. It is a part of barplot show the threshold heat of money on the fare amount of the trip price.

The plot does show the density of how hustle and busy of taxi trip happened in the center. As we can observe there are many dot focus in the 0 to 6 number in the x_axis right picture. The less distance and less pickup distance are located in the center of new your.

Further below you can see how busy of the trip in NYC taxi data. It will perform in the plot

Relevance of direction for fare amount in Dollar

Correlation between travel costs and direction.

At this point I am interested in the total number of passengers whose trips are the main feature for predicting future prices. However, what is the direction of the trips. To describe it specifically I made this plot to show the costs specifically. Based on longitude and latitude of trips in NYC city. And the color with Bar plot shows the expensive level of each trip. From pick up point to drop off point

The function of the select within bounding box is used to select the pickup points on longitude and latitude, this function can be created outside of jupyter as a package. But I have it here, after I choose the latitude and longitude fit, I did an operation to return the average value. For transparency purposes to make the chart below clear. The trip will become apparent.

This heat map is showing the fair amount plot from 0 to 4 level. purple the fair amount is lower not too high.

organge + yellow the fair amount is high because the taxi trip is going fair from the centrum. Some of the trip went out side of the center of this graph. The higher density of taxi trip is concentrate from coorordinate latitude_pickup and pick_longtitude. Finally we can see the trip focus from pick_longtitude from -0.025 to 0.025 it mean in the centrum many customer was booked and other taxi. Mainly it is the Mahattan district.

Outside with orange color. They are a few tiny dot from -0.075 to 0.075 longtitude and latitudewe can know that it is far from the centrum such as going outside near JFK kennedy Air-Port

Understand vendor

Vendor occupies a large part of the current dataset in new york. There are two types 1 and vendor 2 vendor, so we'll have a plot analysis of the number of customers based on violin plot.

I used to plot it Seaborn specifically.

500000

Ven_Id here we have two vendor in the data base.Vendor 1 and vendor2

According to data available in two vendor categories, passenger count trips of zero have a very long and even distribution tail even below zero.

The trip has a frequency of 5.0 to 7.0 miles, there are about 1 to 6 customers and happens regularly. Some trips with 7 and 9 customers are very few.

Mapping Data

NYC TAXI Map Build Funtions

This work has follow the reference from Scipy Packge it help us get some more choose to zoome and plot the meta data point of Pickup and dropoff point. We can see how the passenger start the trip. And the central of Mahattan is extremely busy.

Explain the graph reason

The graph from above shows the density of destinations and arrived destinations of customer.

Yellow dots are represented as Meta Point pixels, allowing users to clearly distinguish between buildings from the center to the suburbs. Because New York is a very big city that's why we should look at the rides with this perspective.

The purpose of this map is to be a starting point for predicting and analyzing customer trips, predicting the density of future trips.

Anticipate the places where the customers frequent the area for the most realistic results.

Then I will split it up into two minimaps where the pick-up and drop-off locations are located. For the purpose of predicting density where customers are going to reach, it support a lot to help for companies to see where they are going to be busiest , with the aim to increase revenue or optimize their rides and services plan like Uber and Lyft.

Cluster Trip:

Each trip usually has five main attributes: pickup and dropoff locations and the trip duration. Let's cluster the total number of trips in up to 80 stereotypical template trips, then we can look at the distribution of each interval of trips, find out how it changes over time.

(0.0, 81.0)

Based on the Cluster Histogram, it represents the density of the frequencies of the travel specials. I have pooled all my trips into 80 different clusters. That shows that the cluster from 0 to 10 has an output rate of more than 50000, and going down from 20 to 20 Cluster Index has range of 30000 times and away from the city range less than 1000 for the 80th cluster. That means the cluster The center is heavily concentrated.

Trend of Trip Direction

My abilities are exciting enough to build a similar map to simulate the rides and moves in Tablaue. Then you'll see how I built it on the tablaue using the Map Layer platform. It is quite time consuming to build, and to help the company define real-time data operation I will attach a picture of Tableau below.

From the point of view from the sky. Like a helicopter to understand how taxis go back and forth in the city.

Yellow dot: Represents the customer drop-off point.

Blue dot: Represents the customer pick-up point.

Light blue line: represented with arrows showing the direction of the customer.

The majority of customers move within the heart of NYC and travel out of New York City to the airport. There are always rides over 20 km of trip distances which appearing in the outside of city suburb linked with airport and around airport and there are rides under 10 km.

There are also small trips from outside the airport and out of the center in areas where few customers also travel, but with relatively low frequency and frequency.

We have the three busiest spots in the world: Manhattan Center, Kennedy Airport at the right bottom of the map, and the northeast edge.

Trips vary widely from John F. Kennedy Airport moving into the city as a shuttle destination.

The pick off point is described in detail in this 3D chart, I have plotted the route and the drop off points are very nice. It is intended to simulate the trip more clearly and avoid many buildings, and the complexity of the roads in the map, because this map helps us to see more clearly, and confirm the destinations.

Reason: This map is used to confirm a clear and easily identifiable drop-off point.

Blue point for drop off point and red for road map.When we have too many points.With roads like these help the eye more easily distinguishable.

You can point and move the map, this is a very user-friendly and eye-catching Map.

After clearly analyzing each time of the day, in the month. We can plot out a map about the density of destinations and pick up points.

Bright blue is where the pick-up point is located. We can see that the pick-up point is quite sparse in the center and a little spread out of outside center. At the airport, the guests were quite tall and spread evenly on the way back from the airport.

Bright blue means that the drop off point is very dense which seems to be higher than the pick up point. We can see that the density at the airport is also high. Beyond the edge of the center we can identify drop-off points that are also spread out, and more in the lower left edge of the map on the right.

Both gave us a specific perspective on the density of trips in NYC. We can then use algorithms to predict the customer's future drop-off points and predict the density of customer trips.

Conclusion:

After declaration summarizing experience from many angles NYC data analysis.

Fair_amount perspective: we can observe the cost of the day, night, month, year. It gives us a specific perspective on the density of trips. From this point, Data Scientist can predict trips density of hour in future. And find out which feature can impact to the trips effective. The number of trips in future is intented to help Driving Service improve it's cost-effectiveness.

Map Perspective: I have been looking to plot densities based on specific mapping techniques as a Data Analyst student. Before that, I used to work in some data project with companies client, so I am quite interested in improving my Data skills.Therefore an attractive map will attract scientists to read the more specific unit. Map features are also demonstrated through various plotting techniques such as heat map, scatterplot, and arrow plotting.

Extra Tableau Report From PDF file.

Alt text that describes the graphic

Tableau gives you a better view of New York City from the air. And detail how the graphs perform from time to cost, with clear breakdowns for each crowded area in NYC. A Tableau that encapsulates multiple graphs gives the company a holistic view.

                                                **END**